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Abstract 

In segmentation problems, inference on change-point position and model selection are two difficult 
issues due to the discrete nature of change-points. In a Bayesian context, we derive exact, non- 
asymptotic, explicit and tractable formulae for the posterior distribution of variables such as the 
number of change-points or their positions. We also derive a new selection criterion that accounts for 
the reliability of the results. All these results are based on an efficient strategy to explore the whole 
segmentation space, which is very large. We illustrate our methodology on both simulated data and 
a comparative genomic hybridisation profile. 
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Short title: Posterior distribution over the segmentation space 

1 Introduction 

Segmentation and change-point detection problems arise in many scientific domains such as econometrics, 
climatology, agronomy or molecular biology. The general problem can be written as follows. It is assumed 
that the observed data {yt}t=i,...,n is a realization of an independent random process Y = {Yt}t=i,...,n- 
This process is drawn from a probability distribution G, which depends on a set of parameters denoted 
by 0. These parameters are assumed to be affected hy K — 1 abrupt changes, called change-points, 
at some unknown positions r2, . . . , rx(with the convention n = 1 and r^+i = n + 1). Thus, the 
change-points delimit a partition m of {1, . . . , n}, called here a segmentation, into K segments such 
that r^^^ = |rfc,r/e+i|= {rk.Tk + 1, . . . ,rfe+i - 1} and 

The segmentation model has the following general form for a given m: 

Yt ^ G{Or) lit e r and rem 

where 9r stands for the parameters of segment r. In this study, all the change-points are detected 
simultaneously, a strategy called off-line detection (as opposed to on-line detection). With this strategy, 
the question of finding the best segmentation in a given number of segments has already been largely 
studied (see for example [Lavielle (2005)] |Braun and Miiller (2QQQ)| |Bai and Perron (2003) |. But two 



important issues remain: assessing the quality of the proposed segmentation and selecting the number 
of segments (also called dimension). In both cases, the main problem is the discrete nature of the 
change-points, which prevents the use of routine statistical inference. 
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On the one hand, the quahty of a given segmentation can be assessed by studying the uncer- 
tainty of the change-point positions. From a non-asymptotic and non-parametric point of view, the 
standard hkehhood-based inference is very intricate, since the required regularity conditions for the 
change-point parameters are not satisfied (Feder (1975)). Different methods to obtain change-point 
confidence intervals have been proposed. 



Most of them are based on the limit distribution of the 
change-p oint estimators (Feder (1975) , Bai and Perron (2003)) or the asymptotic use of a likelihood-ratio 
statistic ( Muggeo (2003) [). Others proposed confidence intervals are based on bootstrap techniques 
(Huskova and Kirch (2008) and references therein). A practical comparison of these methods can be 



found in Toms and Lesperance (2003) 



On the other hand, choosing the number of segments is also a critical issue. This is usually done by 
minimising a penalised contrast function and the problem is to find a good penalty. General penalized 
criteria have been developed, such as AIC ( Akaike (1973)[ ) and BIG (Schwarz (1978)) . In the segmen- 
tation framework, these criteria are not adapted since an exponential model collection is considered 
( |Birg e and Mas sart (2007")] [Baraud et al. (2009) ) and these criteria tend to overestimate the number 
of segments (see for example Lavielle (2005)). Recently, some penalised criteria have been proposed 



specially for the segmentation framework. Some depend on constants to be calibrated (Lavielle (2005) and 
Lebarbier (2005) ), but others do not ( [Zhang and Siegmund (2007) ). More precisely, 'Zhang and Siegmund (2007)] 
discussed the fact that the classical BIG was not theoretically justified in the segmentation context. 
Indeed, the BIG criterion is derived from an asymptotic approximation of the posterior model probabilities 
and requires the likelihood function to be three times differentiable with respect to the parameters of the 
model (Kass and Raftery (1995)[ Lebarbier and Mary-Huard (2006) [ ). As the change-points are discrete 
parameters, the previous condition is not satisfied. A modified BIG criterion has thus been developed by 



Zhang and Siegmund (2007) by considering a continuous-time version of the problem. 



The purpose of our work is to provide exact, non-asymptotic, explicit and tractable formulae for both 
the posterior probability of a segmentation and that of a change-point occurring at a given position. 
More specifically, we consider the segmentation problem in a Bayesian framework so that the posterior 
probability of a segmentation is well defined. To tackle the discrete nature of change-points, we work 
at the segment level, where statistical inference is straightforward. From these segments, the issue is to 
get back to the segmentation or dimension level. Provided that the segments are independent, it will be 
necessary to calculate quantities such as: 



P{Y\m)P{m) = ^ P(m) Y[ ^(^"10 



(1) 



where stands for all observations in segment r and \s usually a very large set of segmentations. 
We propose a close-form (in terms of matrix products) and tractable formulation of such quantities. 
Some similar quantities were computed by Guedon (2008)| in a non-Bayesian context, using a forward- 
backward-like algorithm. However, this author computes all these quantities for fixed values of the 
segment parameters, which are the maximum likelihood estimators. From our formula, we derive key 
quantities to assess the quality of a segmentation and select the number of segments. 

On the one hand, we obtain the exact formulae for both the posterior probability of a segmentation 
and that of a change-point occurring at a given position. This enables the construction of credibility 
intervals for change-points. Moreover, we retrieve the exact posterior probability of a segment within 
a given dimension, the exact entropy of the posterior distribution of the segmentations within a given 
dimension and the exact posterior mean of the signal. 

On the other hand, we derive a so-called 'exact' BIG criterion for choosing the number of segments 
taking M.^ = M.k which is the set of all possible segmentations with K segments. In the same way. 



we derive the IGL criterion of Biernacki et al. (2000) in the segmentation framework. This last criterion 



takes into account the reliability of the results. 

In Section [21 we give some exact formulae to explore the segmentation space and assess the quality of 
a segmentation. In Section [3l we focus on the model selection problem: we derive an exact BIG criterion 
and propose a new IGL criterion. In the last section, we illustrate our results first on Poisson simulated 
data and second on comparative genomic hybridization (GGH) data in a Gaussian framework. 



2 



2 Exploring the segmentation space 



A naive computation of ([T]) is impossible when A^"^ is large, which is usually the case. For example, if 
Ai^ = A4k^ there are (^~\) segmentations of n data into K segments. In this section we propose a 
tractable and close- form formula of ([1]). The following assumption enables us to derive an exact matrix 
product formulation of ([T]) enabling its straightforward computation in 0{Kn^) time. 

Factorability assumption: A model satisfies the factorability assumption if 

(H) : P(y, m) = C n arP(Y^\r) (2) 

where P{Y^\r) = J P{Y^\6r)P{0r)d0r. In the following, for the sake of clarity, we will simply denote 
P(Y^). This is true when all segment parameters are different but this is false, for example, for the 
normal homoscedastic model G{Or) = A/'(/ir, with unknown precision r. 

We denote by A^x(|z, j|) the set of all possible segmentations of {i^Jl into K segments. The simplified 
notation A4k refers to n + 1|). 

Theorem 2.1 Consider a function F such that, for all k G and for all segmentation m G 

A^/c(|l,j|) (for I < j < n -\- 1), there exists a function f such that: F{m) = Ylrem fi''^)- ^ 
a square matrix with n + 1 columns such that 

= filhjl) zfl<i<j<n^l 
— otherwise. 

Then, 

^ F{m) = (A^)i,, 

and the K x (n + 1) elements of 

I E [ 

[meMUnJD J k e li,Kj n j e [i,n+il 

can all be computed in 0{Kn^) 

The proof is given in Appendix lA.li It is based on a linear algebra lemma. The lower triangular part of 
matrix A is set to to fit the segmentation context. Note that, similarly, we have Xl^^^vi^di j|) ^(^) = 
{A^)ij for all 1 < i < j < n + 1. Theorem 12. II will be used many times in the following sections, using a 
specific function /(r) for each quantity of interest. 

2.1 Joint distribution of the data and the segmentation or the dimension 

P(F, m) and P{Y,K) are key ingredients to calculate various quantities of interest, such as ([1]). To 
calculate P(F, m) and 

P{Y,K)= J2 Wm), (3) 

meMK 

we first need to define priors for the segmentation m. We consider here two typical priors. 

Uniform conditional on the dimension: For any prior on the dimension P{K), we define a uniform 
prior distribution for m given its dimension K: 

PimlKM) = ^^^-^ ^ =. Pim) = PiKM) / ^^^-^ \ J (4) 

that is ttr = 1 in ([2]) , denoting K{m) the number of segments (i.. the dimension) of the segmentation 
m. 
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Homogeneous segment lengths: Segmentation with balanced segment lengths are sometimes desir- 
able. They are favoured by the following prior: 

P(m) = C ]^ n~^, where C ensures that ^ P(m) = 1. (5) 

rem meM 

that is = in ([2j), where denotes the length of segment r and Ai the set of all considered 
segmentations. In this case, the prior distribution of m is directly defined and the prior distribution 
of the dimension P{K) is not explicit. Determining the constant C requires summing over all 
possible segmentations. This sum can be handled using the properties given below. 

Proposition 2.2 Under assumption (H), for prior distributions (|4]) and PiY^K) can he computed 
in 0{Kv?') as P{Y^ K) = C{A.^)i^n+i with Aij = for j < i and, for j > i, for prior distribution ^ : 

A.^_P(yM) and C^-^=(]^I^J; 

and for prior distribution (|5]) .* 

A,.i = nf-4P(Fl^-^-I) and C'^ = ^ X[n-\ 

meMK rem 

Proof. For prior distribution (|4]), we use Theorem 12. II with /(r) = P(Y'^), implying Aij = /(p, j|) = 

For prior distribution (|5]), we first retrieve C using Theorem 12.11 with /(r) = n^. The result follows, using 
Theorem 12.11 again, taking /(r) = n~^P{Y^)M 



The preceding results require the calculation of P{Y^). Hence, n{n — l)/2 integrals need to be 
evaluated, corresponding to each possible segment. For general priors, they can be evaluated numerically 
or via any stochastic algorithm. A close form can be obtained if conjugate priors are used. 

Poisson and Gaussian models. We recall classical results for two models that will be used later. 
First is the segmentation problem of a piecewise constant Poisson rate model: 

{/ir} i.i.d., /ir ~ ^am(ar,/3r); 
{yj independent, Yt - P(/ir) if t G r. (6) 

Second is the segmentation of a Gaussian signal where both the mean and the variance are affected by 
the change-points: 

{rr} i.i.d., Tr ^ ^am(i/o/2,2/5o); 
{/ir} independent, /irl^r ~ A/'(/io, (^oTr)""^); 

{Ft} independent, Yt ~ A/'(/ir, l/^r) if t G r. (7) 

For the Poisson model, we get 



For the Gaussian heteroscedastic model, we get 

p(yr. ^ npV^ (.0/2ro/2 r((z.o + n.)/2) ... 

^ ^ (2^)-^/2 r(^o/2) yA^^T^ ^ ^ 

where = 2{nrS^ + 5o + ^rno(^-Mo) -^-i^ ^2 _ ^^^^(Y^ — /rir and yr is the empirical mean of the 
signal within segment r. 
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2.2 Posterior distribution of the change-points and segments 

We now give explicit formulae for the posterior distribution of change-points and segments. We first 
define the corresponding segmentation subsets: 

BK,k{t) is the subset of segmentations from A4k such that the k-th segment starts at position i.e. that 
the {k — l)-th change-point is at t: 

BK,k{t) = {m G Mk ' Tk = t}; 
Bxit) is the subset of segmentations having a change-point at position t: 

k 

5k,/c (1^1,^2 1) is the subset of segmentations having segment r = |ti,t2| as their k-th segment: 

SK,k{lti,t2l) = {me AlK(|l,n + 1|) : Tk = h.Tk+i = ^2}; 

«5k([^15^2D is the subset of segmentations including segment |ti,t2|: 

k 

We denote the conditional probability given the data Y and the dimension K of each of these subsets by 
the corresponding capital letters with same indices, e.g. 

Bxit)^ SK,k{t) and Sxit) are defined similarly. The following proposition gives explicit formulae for 
these probabilities. 

Proposition 2.3 For all |ti,t2[ such that ti < t2, we define, for K >1, 

meMKilti,t2l) 

and we set Ft^^t2{^) = ^/ ^1 ^ ^2- Under assumption (H), the probabilities BK,k{t), Bxit), SK,k{t) 
and Sxit) are 

Fi4k -l)Ft,n+i{K-k + l) 
^K.kyt) = 



SK,k{tl,t2) 



P{Y\K) 

Fi^t,{k-l)Ft,,tAl)Ft,,n+i{K-k) 
P{Y\K) 



BK{t) = Ek=iBK,k{t) and SK{ti,t2) = T.kSKA^iM)- 

The proof is given in Appendix IA.2[ It is mainly based on set decompositions, such as 

BkA^) = Mk-iilhtl) X A4K-fc+i(It,n + 1[) (9) 

and all sums over A1/c_i(|l,t|) and A^x-/c+i(I^7 ^ + can be obtained with Theorem 12.11 

{BK,k{t)}t provides the exact posterior distribution of the starting point of the k-th segment, given 
dimension K. From that, we get the exact credibility of interval 1^1,^2] for change-point r^: 

t2 

CKAltut2j) = PT{n e ltut2j\Y,K} = Y.BKAt)' 
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2.3 Retrieving the mean signal 

In many applications, the mean value jj^t of the signal at a given position can also provide some insight 
about the phenomenon under study. This mean signal can be retrieved via model averaging over the 
segmentation space. The posterior mean of the signal is 

SK{t)= J2 PiMy,K)Sm{t), (10) 

where =E[/it|m,F]. 

Proposition 2.4 The posterior mean of the signal given the dimension is 

r3t 

where Jir = Under assumption (H.), it can be computed with a quadratic complexity. 

Proof. If a segment r belongs to a segmentation m and if position t lies in segment r then = /i^. 

The rest of the formula is straightforward. Assumption (H) ensures that the Sxi^) can be computed in 
0{Kn^). ■ 

2.4 Posterior entropy 

Segmentation problems are often reduced to choosing m^, the best segmentation (i.e. the one with 
maximal posterior probability) with dimension K. The other segmentations with dimension K are rarely 
considered. The entropy of the distribution P{m\Y^K) 

n{K) = - P{m\Y, K) log P{m\Y,K) 

meM-K 

measures how the posterior distribution is concentrated around the best segmentation. Intuitively, a 
small entropy 1-L{K) means that the best segmentation is a much better fit to the data than any other 
segmentation. We use this information in Section [3] for model selection. 

Proposition 2.5 Under assumption (H)^ the posterior entropy H{K) is 

n{K) = - ^ SK{r) log /(r) + log Ak 

r 

where f{r) = a^P(y^) and Ak = "^rneMK HrGm /(^)^ which can be computed using Proposition \2.2\ 
Proof. Since all distributions can be factorized, we have 

meM-K rem meM-K 

= -^log/(r) P{m\Y,K) + log Ak ^ P{m\Y,K) 

r meMK jm3r meMK 

and the result follows. ■ 
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3 Model selection 



In a Bayesian framework, the BIG criterion aims to choose the model which maximises P{M\Y)^ where M 
is the model. To calculate the BIG criterion, one needs to know P{Y\M) = J P{Y\Om , M)P{OM\M)dOM, 
where Om is the set of parameters of the model M. Similar quantities are involved in the Bayes factor 
for model comparison ( Kass and Raftery (1995)[ ). 



In our case, the word 'model' is too broad and we have to distinguish between the selection of the 
dimension K and the selection of the segmentation m. When considering the choice of i^T, a direct appli- 
cation of the Laplace approximation is not theoretically justified to calculate the previous integral because 



the required differentiability condition is not satisfied for change-points (Zhang and Siegmund (2007)). 
However, we can bypass the problem by working at the segment level and then going back at the 
dimension level using Proposition 12.21 Thus, the derivation of BIG criteria only requires the calculation 
of P(y^) = J P{Y^\0r)P{0r)d0r^ which can be obtained in a close form for the Poisson model and the 
heteroscedastic Gaussian model as shown in Section [2711 Moreover, we derive an adaptation of the IGL 



criterion, first proposed for mixture models, to the segmentation context ( Biernacki et al. (2000)[ ) 



3.1 Exact BIC criterion for dimension and segmentation selection 

Choice of the dimension. In segmentation problems, the selection of the 'best' number of segments 
K can be addressed per se, or as a first step toward the selection of the 'best' segmentation. The Bayesian 
framework suggests to choose 

k = argminBIG(i^), where BIG(i^) = - log P(r,i^). (11) 

K 

BIC{K) can be computed in a quadratic time, using Proposition 12.21 

Choice of the segmentation. The best segmentation can be chosen in two ways. 

Two-step strategy: The 'best' segmentation m can be chosen, conditionally to the pre-selected dimension 
K as 

7n{K) = argminBIG(m|K), where BIG(m|^) = - logP(r, m|^). (12) 
One-step strategy: The 'best' segmentation m can also be directly chosen among a larger collection 



m = argmin BIG(m), where BIG(m) = — log P(F,m). (13) 

meM 



= UfeLi as 



Both BlC{m\K) and BIG(m) can be computed efficiently thanks to Proposition 
3.2 ICL criterion for dimension selection 

In the framework of incomplete data models (e.g. mixture models), Biernacki et al (2000)" suggest to 



use the criterion IGL(M), which is an estimate of E[logP(F, Z, M)|y] where Z stands for the unobserved 
variables. Based on the equation 

E[log P{Y,Z\M)\Y] = logP(y'lM) +E[logP(Z|r,M)|r], 

they argue that the entropy H{Z\Y^ M) = —¥,[\ogP{Z\Y^M)\Y] is an intrinsic penalty term. The IGL 
criterion will tend to select models that provide a reliable prediction of Z, i.e. with a small entropy. This 
may be desirable, for example in the classification context. 

In the segmentation context, the segmentation m can be considered as an unobserved variable. The 
dimension K can then be chosen according to the IGL as 

K = argniinlGL(i^) where IGL(K) = - log P {Y, K) ^ H{m\Y, K). 
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Biernacki et at. (2000) We expect ICL to favour the dimension K where the best segmentation m{K) 



clearly outperforms the other segmentations in K segments, so that m{K) is more reliable. 

3.3 Comparison with other penahzed criteria 

Many model selection criteria have the following form: 

log P{Y\0^ m) — pen(m) 

and use a two-step strategy. Interestingly, since the penalty generally depends only on the dimension 



(Lebarbier (2005) , |Lavielle (2005) ), the best segmentation m{K) does not actually depend on the penalty. 

The calculation of the exact BIG does not provide any explicit penalty enabling a direct comparison 
with such criteria. For such comparison, we derive two approximations of log P{Y^) = log / P{Y^\6r)P{0r)d6r 
in the heteroscedastic Gaussian case. The first one is based on a Laplace approximation: 

iogP(r'-) « logP(yi^r) - ^logn, 

where D stands for the number of parameters involved in each segment (here, D = 2). This approximation 
is valid only for large segments, i.e. where P{Y^\Or) satisfies regularity conditions. For the second 
approximation, we let the hyperparameters no, i^o and So go to in (|8]) and we obtain 

n D ^ D 

\ogP(Y-) « -y log 52 - -logn, « \ogP{Y^\er) - -logn,. 

We emphasize that these approximations are both questionable since the asymptotic framework of the 
Laplace approximation is not correct for small segments and because the priors are improper for null 
hyperparameters. Our purpose is only to show that they both provide the same penalty form: 

logP(m|r) ^ logP(m) +logP(r|^,m) - y ^ logn^. 
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Using uniform prior (|4]), we get 



pen(m) = log P(K(m)) - log ^ \ ^ - ^ Yl ^^^^r- 

A similar form is obtained in the Poisson case. The complexity term, log (^~\) , is similar to the one of 



Lebarbier (2005) The regularity term, X^^^^logn^, favours segments with equal lengths and is similar 



to the one of Zhang and Siegmund (2007) [ Using the alternative prior ([5]) reinforces the regularity term. 



Due to this term, the best segmentation m{K) within A4k does depend on the penalty. 



4 Applications 

In this section, we first present a simulation study to assess the ability of the exact BIG and IGL criteria 
to select the dimension and the ability of model averaging to retrieve the mean signal. We then analyse 
a real GGH profile and use our formulae to assess the quality of the segmentation. 

4.1 Simulations 

Simulation design. We performed the simulation study in the Poisson model @ so that only one 
parameter had to be chosen. We simulated a sequence of 150 observations affected by six change-points 
at the following positions: 21, 29, 68, 82, 115, 135. Odd segments had a mean of 1, while even segments 
had a mean of 1 + A, where lambda varies from to 10. The higher A is, the easier it should be to recover 
the true number of change-points. The hyperparameters a and /3 were set to be equal and we considered 
three values for them: 0.01, 0.1 and 1. For each configuration, we simulated 300 sequences. 
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Figure 1: Percentage of true dimension recoveries as a function of A. Left panel: for the three criteria. 

BIC(mK) : ■-, BlC(i^) : • and ICL(i^) : A---. Right panel: for the BIG criteria; •: uniform 

prior over all segmentations, ■: uniform prior over all segmentations of a dimension, — : a = /3 = 1, 
:a = /3 = 0.1, ••• :a = /3 = 0.01. 



4.2 Recovering the number of change-points 

4.2.1 The ICL criterion performed better than the BIC criterion 

Model selection. The BIC criterion for dimension selection, BlC(i^), almost never returned the true 
dimension, even for high values of A (Figure [H where a and P were set to 1). On the other hand, both 
the BIC criterion for model selection, BlC(m), and the ICL criterion, ICL(i^), tend to recover the true 
dimension more often when A became larger. ICL(i^) even increased to a maximum of 99% true recoveries 
compared to a maximum of 91% for the BlC(m) criterion for model selection. 

Influence of the priors. The ability of BlC(m) to retrieve the true dimension was greatly affected 
by the prior distribution of the segmentation (Figured]). To illustrate this effect, we considered a prior 
that gave equal probability to all segmentations, whatever their dimension: P{m) = est. This led to a 
90% decrease in the ability to return the true dimension compared to a conditional uniform prior given 
the dimension (jH (with P{K{m)) = est whatever m). The impact of the two hyperparameters a and 
(3 seemed relatively limited in comparison: less than 10% difference in the ability to return the true 
dimension (Figured]). 

Estimation of the mean signal. We then compared the ability of the maximum likelihood estimators 
(MLE) and that of the posterior mean signal to recover the true signal in terms of the Kullback-Leibler 
distance. For each simulation, we computed the following: 

t 

for both the MLE estimate Jl = ^mle and the posterior mean /i = SK{t) (see equation (p!Q|) ). 

When K was lower than the true dimension (7 segments), the two estimates were almost equivalent 
(Figure [2]). However, for larger dimensions, the distance of the MLE to the true signal increased whereas 
the distance of the posterior mean did not (FigureEJ. The posterior mean seemed less prone to over-fitting. 
Moreover, for a very small signal-to- noise ratio (A = 1), the distance between the posterior mean of the 
signal and the true signal still decreased when K was higher than the true dimension. Therefore, when 
the signal was of poor quality and led to a poor assessment of the true dimension, the posterior mean of 
the signal led to better results. Moreover, the standard deviation of d for the posterior mean is almost 
always smaller than the one of the MLE (not shown). 
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Figure 2: Kullblack-Leibler-based distance d to the true signal as a function of the dimension. ■: 
g^(Mmle, m), d{'p^ ji) for three value of A 1: — , 2: and 6: • • • . The true number of segments was 7. 




Figure 3: Left panel: Chromosome 10 profile of cell line BT474. The DNA copy number logratio is 
represented as a function of its position along the chromosome. Right panel: (Left axis) BlC(m): 
BlC(i^): • and ICL(i^): ■ as a function of the dimension. (Right axis) T-L{K) —T-L{K — 1): o as a function 
of the dimension. 



4.3 Analysis of a CGH profile 

In the following subsection, we used a comparative genomic hybridation (CGH) profile to illustrate 
our methodology. CGH enables the study of DNA copy number gains and losses along the genome 
( Pinkel et al. (1998)[). We used the G aussian segmentation model defined in ([7]) that is often used for 
this type of data ( [Picard et al. (2005) ). The profile shown in Figure [3] represents the copy number logratio 
of cell line BT474 to a normal reference sample, along chromosome 10. 



Model selection. Since the true dimension was unknown, the first issue was to choose one. The 
ICL(i^) criterion selected 4 segments whereas BlC(m) selected a segmentation with 3 segments (Figure 
[3]). The additional penalty term involved in ICL does not necessarily penalise larger dimensions. In our 
example, ICL selected a segmentation with a larger dimension because it was more reliable. The choice 
of ICL was motivated by the relatively small gain of entropy between dimensions 3 and 4. This choice 
was also supported by the posterior distributions of the change-points and that of the segments shown 
below. The best segmentations for 3 and 4 segments are shown on Figure 2] (z). 
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Posterior probability of the change-point positions. The distribution of the successive change- 
points for dimensions 3 and 4 are shown on Figure |4] (ii). For dimension 3, the exact intervals with 
credibihty 95% were |64, 78] and |92, 97] for r2 and rs, respectively For dimension 4, the intervals were 
|66,78], [78,97] and [91,112] for r2, rs and r4, respectively 

The existence of a change-point at a given position t is assessed by posterior probability Bxit). Note 
that, contrarily to Bx.kif)^ Bxit) is not a probability distribution over the positions, because its sum is 
the number of change-points: K — 1. In our example, the posterior probabilities B4{t) presented sharper 
peaks than Bs{t) (see Figured] (in)), which was consistent with the choice of the ICL criterion that 
favours reliable segmentations. 

Posterior probabihty of a segment. Similar conclusions were drawn from the posterior probability 
of the segments. In Figure |4] (iv) each point corresponds to a segment. A reliable dimension should 
display K sharp peaks. The position of the first two segments are very uncertain for = 3, due to 
the uncertainty of r2. Their position were much more certain with K = 4. In particular, the smallest 
segment from K = 4 at positions [78, 79] had a relatively high probability of 0.34. 

Posterior mean of the signal. Similarly, the posterior mean for 3 segments was different from the 
one for 4 segments (Figure [5]); the former failed to capture the small deletion at [78,79]. As soon as 
K exceeded 4, the posterior mean of the signal was very stable, see the example for K = 5 segments in 
Figure [S] 

All presented results show that, the segmentation in 4 segments selected by the ICL(i^) is more reliable 
than the segmentation in 3 segments selected by the BlC(m). 
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A Lemma and Proofs 

A.l Proof of Theorem EH 

The proof of the theorem relies on the following lemma. 

Lemma A.l Let A be a square matrix with n columns. For all k eN, we define the function /a,/c as: 

ti=i, tk+i=j k 

V(i,j)ell,nf fAAh3)= E W^UM^. 

The n elements of {fA,k{h j)}{i e [i,n|} for 1 < k < K can all be computed in 0{Kn^) as 

fAAhj) = (A^)i,j 



12 



Proof of the Lemma. fA,k{hj) = -^jj holds for k = 1. Suppose that /a(/^,^, j) = holds for a 
given /c G N. For + we have: 

ti=i,tk+2=j /c+1 n ti=i,tk+i=t k n 

/A,/c+l(^,j) = Il^ti,U+i=Yl Yl^U,U+i' Atj = ^ /A,/c(^,t). Atj 

Using our induction hypothesis and by definition of the matrix product, we obtain: 



fA,k+i{iJ) = Y^lt^i^j = A; 



t=i 



Thus, the K x n elements of the form 

{/A,/c(^l,^fe+l)}|/e ^ n tk+i e [l,nl} 

can be computed in 0{Kn^) as the ti-th line of matrices A, • • • , A^ respectively. ■ 

Proof of the Theorem. For any (ti, in |l,n + 1]^+-^ such that we do not have ti < t2 - - - < 

^k+i, riiLi MiM+i = 0- Therefore, for ah k G |l,i^] and for ah j in |l,n]: 

E E X{^uM,.= E nAt.t.+. 

mGA4fc(|l,i|) ti<t2---<^fc+i i=l (t2v^fc) e |l,n+lp-ii=l 

Using Lemma [A. II on matrix A and integer we see that the K x (n + 1) terms of the form 



mGAlfc([l,j[) 



G n iG[l,n+ll 



can be computed as X]mGA4fc(|i j|) ^(^) ~ (A^)i,j and that therefore they can all be computed in 
0{Kin?) as the first line of the successive powers of matrix A. 

A.2 Proof of Proposition [2731 

Proof. We first consider the posterior distribution of the change-points. With Equation (|9|), we obtain 

^ . . ^ T.meB^,,it)P{nm)P{m\K) ^ - l)F,,^+i(i^ - + 1) 

Using Theorem 12.11 we see that all the F functions can be computed in 0{Kn^). 0{K'^n) products 
and divisions remain to be done to compute all BK,k{t)^ so the overall complexity is in 0{Kii?). The 
probability Bxit) follows straightforwardly. 

We now consider the posterior distribution of the segments. We first quote that if ti = 1, then 
5'k,i(1,^2) = ^K,2(^2)- Similarly, when ^2 = n + 1, we have S k,k {ti^'t2) = So we only have to 

consider the case where l<ti<t2<n + l. Since «SK,/c(|ti, ^2!) can be decomposed as 

SKAltut2l) = Mk-iilhhl) X {1^1,^21} X MK-k{lt2.n^ 1|), 

we have 

^ , ^ _ ^mes^Mltut2l)P(^\'^)P('^\'') _ F.^tAk - l)Ft„t,{l)Ft,,n+i{K - k) 
!=KAtiM) - - ■ 

Again using Theorem 12. If we see that all the F functions can be computed in 0{Kin?). We then need to 
compute O(n^) products and divisions to get the 5'K,/c(ti, ^2), thus the overall complexity is in 0{Kn^). 
The last probability comes from the definition of SK{ti^t2). 0{Kn^) additions remain to be done the 
overall complexity is therefore in 0{Kin?). ■ 
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Figure 4: (z): Best segmentation of the profile in 3 (left) and 4 (right) segments. • represent the logratio 
as a function of the position along the chromosome. — : averaged signal of the segment. • • • : change-point 
positions, (ii): Posterior probability that the k-th change-point is at position t knowing that there is 
either 3 (left) or 4 (right) segments. Probability of the first change-point: — , probability of the second 

change-point: and probability of the third change-point: • • • . (iii): Posterior probability that there 

is a change-point at position t knowing that there is 3 (right) or 4 (left) segments, {iv) : 3D plot of the 
probability of all segments. Left panel: K = 3 segments; right panel: K = A segments, x-axis: ti, ?/-axis: 
t2, z-axis: 6'(|ti,t2[). 
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Figure 5: Posterior mean of the signal; Left: K = 3 segments; Center: K = 4 segments; Right: K = 5 
segments. •: logratio as a function of the position along the chromosome. — : posterior mean of the 
signal. • • • : change-point positions of the best segmentation. 
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